<!--begin.rcode results='hide', echo=FALSE, message=FALSE
library(caret)
data(BloodBrain)
library(proxy)
library(latticeExtra)
library(animation)
library(ggplot2)

theme1 <- caretTheme()
theme1$superpose.symbol$col = c(rgb(1, 0, 0, .4), rgb(0, 0, 1, .4), 
  rgb(0.3984375, 0.7578125,0.6445312, .6))
theme1$superpose.symbol$pch = c(15, 16, 17)
theme1$superpose.cex = 1
theme1$superpose.line$col = c(rgb(1, 0, 0, .9), rgb(0, 0, 1, .9), rgb(0.3984375, 0.7578125,0.6445312, .6))
theme1$superpose.line$lwd <- 2
theme1$superpose.line$lty = 1:3
theme1$plot.symbol$col = c(rgb(.2, .2, .2, .4))
theme1$plot.symbol$pch = 16
theme1$plot.cex = 1
theme1$plot.line$col = c(rgb(1, 0, 0, .7))
theme1$plot.line$lwd <- 2
theme1$plot.line$lty = 1

hook_inline = knit_hooks$get('inline')
knit_hooks$set(inline = function(x) {
  if (is.character(x)) highr::hi_html(x) else hook_inline(x)
  })
opts_chunk$set(comment=NA)

session <- paste(format(Sys.time(), "%a %b %d %Y"),
                 "using caret version",
                 packageDescription("caret")$Version,
                 "and",
                 R.Version()$version.string)
    end.rcode-->

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
  <!--
  Design by Free CSS Templates
http://www.freecsstemplates.org
Released for free under a Creative Commons Attribution 2.5 License

Name       : Emerald 
Description: A two-column, fixed-width design with dark color scheme.
Version    : 1.0
Released   : 20120902

-->
  <html xmlns="http://www.w3.org/1999/xhtml">
  <head>
  <meta name="keywords" content="" />
  <meta name="description" content="" />
  <meta http-equiv="content-type" content="text/html; charset=utf-8" />
  <title>Data Splitting</title>
  <link href='http://fonts.googleapis.com/css?family=Abel' rel='stylesheet' type='text/css'>
  <link href="style.css" rel="stylesheet" type="text/css" media="screen" />
  </head>
  <body>
  <div id="wrapper">
  <div id="header-wrapper" class="container">
  <div id="header" class="container">
  <div id="logo">
  <h1><a href="#">Data Splitting</a></h1>
</div>
  <!--
  <div id="menu">
  <ul>
  <li class="current_page_item"><a href="#">Homepage</a></li>
<li><a href="#">Blog</a></li>
<li><a href="#">Photos</a></li>
<li><a href="#">About</a></li>
<li><a href="#">Contact</a></li>
</ul>
  </div>
  -->
  </div>
  <div><img src="images/img03.png" width="1000" height="40" alt="" /></div>
  </div>
  <!-- end #header -->
<div id="page">
  <div id="content">
  
<h1>Contents</h1>  
<ul>
  <li><a href="#outcome">Simple Splitting Based on the Outcome</a></li>
  <li><a href="#predictors">Splitting Based on the Predictors</a></li>
  <li><a href="#time">Data Splitting for Time Series</a></li> 
 </ul>   
      
 <div id="outcome"></div>        
<h1>Simple Splitting Based on the Outcome</h1>

<p>The function <span class="mx funCall">createDataPartition</span> can be used to create balanced splits of the data. If the <span class="mx arg">y</span> argument to this function is a factor, the random sampling occurs within each class and should preserve the overall class distribution of the data. For example, to create a single 80\20% split of the iris data:</p>

<!--begin.rcode split1,tidy=FALSE
library(caret)
set.seed(3456)
trainIndex <- createDataPartition(iris$Species, p = .8, 
                                  list = FALSE, 
                                  times = 1)
head(trainIndex)

irisTrain <- iris[ trainIndex,]
irisTest  <- iris[-trainIndex,]
    end.rcode-->

<p>The <span class="mx arg">list</span><tt> = FALSE</tt> avoids returns the data as a list. This function also has an argument, <span class="mx arg">times</span>, that can create multiple splits at once; the data indices are returned in a list of integer vectors. Similarly, <span class="mx funCall">createResample</span> can be used to make simple bootstrap samples and <span class="mx funCall">createFolds</span> can be used to generate balanced cross&ndash;validation groupings from a set of data.</p>

<div id="predictors"></div> 
<h1>Splitting Based on the Predictors </h1>

<p>Also, the function <span class="mx funCall">maxDissim</span> can be used to create sub&ndash;samples using a maximum dissimilarity approach (<a href="http://www.liebertonline.com/doi/abs/10.1089/106652799318382">Willett, 1999</a>). Suppose there is a data set <i>A</i> with <i>m</i> samples and a larger data set <i>B</i> with <i>n</i> samples. We may want to create a sub&ndash;sample from <i>B</i> that is diverse when compared to <i>A</i>. To do this, for each sample in <i>B</i>, the function calculates the <i>m</i> dissimilarities between each point in <i>A</i>. The most dissimilar point in <i>B</i> is added to <i>A</i> and the process continues. There are many methods in R to calculate dissimilarity. <a href="http://cran.r-project.org/web/packages/caret/index.html"><strong>caret</strong></a> uses the <a href="http://cran.r-project.org/web/packages/proxy/index.html"><strong>proxy</strong></a> package. See the manual for that package for a list of available measures. Also, there are many ways to calculate which sample is &ldquo;most dissimilar&rdquo;. The argument <span class="mx arg">obj</span> can be used to specify any function that returns a scalar measure. <a href="http://cran.r-project.org/web/packages/caret/index.html"><strong>caret</strong></a> includes two functions, <span class="mx funCall">minDiss</span> and <span class="mx funCall">sumDiss</span>, that can be used to maximize the minimum and total dissimilarities, respectfully.</p>

<p>As an example, the figure below shows a scatter plot of two chemical descriptors for the Cox2 data. Using an initial random sample of 5 compounds, we can select 20 more compounds from the data so that the new compounds are most dissimilar from the initial 5 that were specified. The panels in the figure show the results using several combinations of distance metrics and scoring functions. For these data, the distance measure has less of an impact than the scoring method for determining which compounds are most dissimilar.</p>

<!--begin.rcode SplitMaxDiss1
library(mlbench)
data(BostonHousing)

testing <- scale(BostonHousing[, c("age", "nox")])
set.seed(5)
## A random sample of 5 data points
startSet <- sample(1:dim(testing)[1], 5)
samplePool <- testing[-startSet,]
start <- testing[startSet,]
newSamp <- maxDissim(start, samplePool, n = 20)
head(newSamp)
    end.rcode--> 

<!--begin.rcode SplitGIF, results='hide', echo=FALSE, message=FALSE, fig.show='hide', eval = FALSE
testing <- scale(BostonHousing[, c("age", "nox")])
testing <- as.data.frame(testing)
testing$index <- 1:nrow(testing)

set.seed(5)
## A random sample of 5 data points
startSet <- sample(1:dim(testing)[1], 5)
samplePool <- testing[-startSet,]
start <- testing[startSet,]

newSamp <- maxDissim(as.matrix(start[, 1:2]), 
                     as.matrix(samplePool[, 1:2]), 
                     n = 20)
head(newSamp)
new_index <- samplePool$index[newSamp]

plot_data <- BostonHousing[, c("age", "nox")]
plot_data$Label <- "Pool"
plot_data$Label[start$index] <- "Starting Set"
plot_data$Label <- factor(as.character(plot_data$Label), 
                          levels = c("Pool", "Starting Set", "Sampled"))

library(ggplot2)
cbPalette <- c("#999999", "#0072B2", "#D55E00")

library(animation)

ani.options(ani.width = 550,  outdir = getwd(), autoplay = FALSE)

ani.record(reset = TRUE)

for(i in 1:5) {
  print(ggplot(plot_data, aes(x = age, y = nox, color = Label, size = Label)) + 
          geom_point() +
          scale_colour_manual(values=cbPalette) + 
          ggtitle(paste("Data pool with random starting samples ", 
                        paste(rep(".", 5-i+1), collapse = ""))))
  ani.record()
}

for(i in 1:length(newSamp)) {
  tmp <- plot_data
  tmp$Label[new_index[1:i]] <- "Sampled"
  tmp$Label <- factor(as.character(tmp$Label), levels = c("Pool", "Starting Set", "Sampled"))
  fname <- gsub(" ", "0", format((1:length(newSamp))+1)[i])
  

  print(ggplot(tmp, aes(x = age, y = nox, color = Label, size = Label)) + 
          geom_point() + scale_colour_manual(values=cbPalette) + 
          ggtitle(paste("Adding sample", i)))
  ani.record()
}

for(i in 3:1) {
  print(ggplot(tmp, aes(x = age, y = nox, color = Label, size = Label)) + 
          geom_point() + scale_colour_manual(values=cbPalette) + 
          ggtitle(paste("Restarting in ", i)))
  ani.record()
}


saveGIF(ani.replay(), img.name = "record_plot", movie.name = "MaxDissim.gif")
    end.rcode--> 
    
<p>The visualization below shows the data set (small points), the starting samples
(larger blue points) and the order in which the other 20 samples are added. </p>


<div class="chunk" id="Split_gif"><div class="rimage default"><img src="MaxDissim.gif" class="plot" /></div></div>


 <div id="time"></div>        
<h1>Data Splitting for Time Series</h1>

<p>
Simple random sampling of time series is probably not the best way to resample times series data. <a href="https://www.otexts.org/fpp/2/5">Hyndman and Athanasopoulos (2013))</a> discuss <i>rolling forecasting origin</i> techniques that move the training and test sets in time. caret contains a function called <span class="mx funCall">createTimeSlices</span> that can create the indices for this type of splitting. 
</p>
<p>
The three parameters for this type of splitting are:
</p>

<ul>
  <li><span class="mx funCall">initialWindow</span>: the initial number of consecutive values in each training set sample</li>
  <li><span class="mx funCall">horizon</span>: The number of consecutive values in test set sample</li>
  <li><span class="mx funCall">fixedWindow</span>: A logical: if FALSE, the training set always start at the first sample and the training set size will vary over data splits.</li> 
 </ul> 
<p>
As an example, suppose we have a time series with 20 data points. We can fix <tt><!--rinline 'initialWindow = 5'  --></tt> and look at different settings of the other two arguments. In the plot below, rows in each panel correspond to different data splits (i.e. resamples) and the columns correspond to different data points. Also, red indicates samples that are in included in the training set and the blue indicates samples in the test set. 
</p>
<!--begin.rcode Split_time, echo=FALSE, fig.width=9
times <- 1:20

case1Index <- createTimeSlices(times, initialWindow = 5, fixedWindow = FALSE)
case1 <- expand.grid(time = times, resample = seq(along = case1Index$train))
case1$Group <- NA
case1$WindowSize <- "horizon = 1"
case1$Fixed <- "fixedWindow = FALSE"
for(i in seq(along = case1Index$train)) case1[case1$resample == i & case1$time %in% case1Index$train[[i]], "Group"] <- -1
for(i in seq(along = case1Index$test)) case1[case1$resample == i & case1$time %in% case1Index$test[[i]], "Group"] <- 1

case2Index <- createTimeSlices(times, initialWindow = 5, fixedWindow = TRUE)
case2 <- expand.grid(time = times, resample = seq(along = case2Index$train))
case2$Group <- NA
case2$WindowSize <- "horizon = 1"
case2$Fixed <- "fixedWindow = TRUE"
for(i in seq(along = case2Index$train)) case2[case2$resample == i & case2$time %in% case2Index$train[[i]], "Group"] <- -1
for(i in seq(along = case2Index$test)) case2[case2$resample == i & case2$time %in% case2Index$test[[i]], "Group"] <- 1


case3Index <- createTimeSlices(times, initialWindow = 5, fixedWindow = FALSE, horizon = 5)
case3 <- expand.grid(time = times, resample = seq(along = case3Index$train))
case3$Group <- NA
case3$WindowSize <- "horizon = 5"
case3$Fixed <- "fixedWindow = FALSE"
for(i in seq(along = case3Index$train)) case3[case3$resample == i & case3$time %in% case3Index$train[[i]], "Group"] <- -1
for(i in seq(along = case3Index$test)) case3[case3$resample == i & case3$time %in% case3Index$test[[i]], "Group"] <- 1

case4Index <- createTimeSlices(times, initialWindow = 5, fixedWindow = TRUE, horizon = 5)
case4 <- expand.grid(time = times, resample = seq(along = case4Index$train))
case4$Group <- NA
case4$WindowSize <- "horizon = 5"
case4$Fixed <- "fixedWindow = TRUE"
for(i in seq(along = case4Index$train)) case4[case4$resample == i & case4$time %in% case4Index$train[[i]], "Group"] <- -1
for(i in seq(along = case4Index$test)) case4[case4$resample == i & case4$time %in% case4Index$test[[i]], "Group"] <- 1

plotdata <- rbind(case1, case2, case3, case4)

useOuterStrips(
  levelplot(Group ~ factor(time)*factor(resample)|WindowSize*Fixed, data = plotdata,  
          colorkey = FALSE,
          col.regions = c("#FBB4AE", "#B3CDE3"),
          at = c(-Inf, 0, Inf),
            xlab = "Time Point",
            ylab = "Resample Number")) 
    end.rcode-->  
    
    
<div style="clear: both;">&nbsp;</div>
  </div>
  <!-- end #content -->
<div id="sidebar">
<ul>
  <li>
    <h2>General Topics</h2>
    <ul>
      <li><a href="index.html">Front Page</a></li>
      <li><a href="visualizations.html">Visualizations</a></li>
      <li><a href="preprocess.html">Pre-Processing</a><li>
      <li><a href="splitting.html">Data Splitting</a></li>
      <li><a href="varimp.html">Variable Importance</a></li>
      <li><a href="other.html">Model Performance</a></li>
      <li><a href="parallel.html">Parallel Processing</a></li>
    </ul>
    <h2>Model Training and Tuning</h2>
    <ul>
      <li><a href="training.html">Basic Syntax</a></li>
      <li><a href="modelList.html">Sortable Model List</a></li>
      <li><a href="bytag.html">Models By Tag</a></li>
      <li><a href="similarity.html">Models By Similarity</a></li>
      <li><a href="custom_models.html">Using Custom Models</a></li>
      <li><a href="sampling.html">Sampling for Class Imbalances</a></li> 
      <li><a href="random.html">Random Search</a></li> 
      <li><a href="adaptive.html">Adaptive Resampling</a></li> 
    </ul>
    <h2>Feature Selection</h2>
    <ul>
      <li><a href="featureselection.html">Overview</a>
      <li><a href="rfe.html">RFE</a></li>
      <li><a href="filters.html">Filters</a></li>
      <li><a href="GA.html">GA</a></li>
      <li><a href="SA.html">SA</a></li>
    </ul>  
  </li>
</ul>
</div>
<!-- end #sidebar -->
<div style="clear: both;">&nbsp;</div>
  </div>
  <div class="container"><img src="images/img03.png" width="1000" height="40" alt="" /></div>
  <!-- end #page -->
</div>
  <div id="footer-content"></div>
  <div id="footer">
  <!--begin.rcode echo = FALSE
    knit_hooks$set(inline = hook_inline)    
    end.rcode-->   
  <p>Created on <!--rinline I(session) -->.</p>
  </div>
  <!-- end #footer -->
</body>
  </html>
